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METHOD FOR IDENTIFYING MOTIFS AND/OR COMBINATIONS OF MOTIFS 
HAVING A BOOLEAN STATE OF PREDETERMINED MUTATION 
IN A SET OF SEQUENCES AND ITS APPLICATIONS 

Related Application 

[0001] This is a continuation of International Application No. PCT/FR02/02068, with an 

international filing date of June 14, 2002, which is based on French Patent Application No. 
01/07808, filed June 14, 2001. 

Field of the Invention 

[0002] This invention pertains to the field of analysis of sequences of nucleotides and/or 

amino acids composing living organisms, in particular, analysis of particular mutations of the 
sequences. 

[0003] The invention also pertains to methods of identification and selection of fragments 

of sequences of nucleic acids or proteins constituted by and/or comprising motifs having 
characteristics of specific mutability. The invention further pertains to pharmaceutical 
compositions containing the fragments that are useful for treating and/or preventing human, 
animal and/or plant pathologies or are useful for screening therapeutic compounds. 

Background 

[0004] It is known that the mutations induced in the wild sequences of pathogenic 

organisms are responsible, for example, for therapeutic escape mechanisms, i.e., the capacity of 
viral or bacterial pathogenic organisms to resist a therapeutic treatment. The nucleotide and/or 
polypeptide sequences of the mutant strains of the organisms have particular mutations in 
relation to the nucleotide or polypeptide sequences of the wild strains. 



[0005] Such mutations are also determinant of functional changes of the genes or proteins 

which have as a consequence the deterioration of numerous biological processes, such as the 
triggering of the immune response, infectivity of viruses, development of cancers, etc. 
[0006] It is known, for example, that the genetic information of the human 

immunodeficiency virus (HIV), which belongs to the retrovirus family, is supported by two RNA 
molecules. Upon infection, integration of the viral genome with that of host cells can therefore 
not be implemented directly. The prior synthesis of a DNA copy from the genomic RNA of the 
virus is a determinant step of the infectious cycle. The enzyme responsible for this reverse 
transcription is a protein called Reverse Transcriptase (RT). The low reverse-transcriptional 
accuracy of this protein confers on the virus a large genomic variability. It is estimated that in an 
untreated serum-positive individual, one mutation appears per replication and, thus, for the ten 
billion viruses produced per day, there would be 10 billion new mutations. This mutation can 
lead to resistance to one or more antiretroviral agents and, thus, generate strains that are more 
virulent because they are increasingly resistant. 

[0007] Faced with this problematic situation, practitioners prescribe very intense 

treatment regimens such as long-term triple drug combinations and, more recently, even 
quadruple drug combination and, perhaps even more in the future, profiting from the absence of 
resistant virus which characterize in general the patients who have not yet been treated and are 
infected by a single form of virus. These treatments then cause a strong diminution of the viral 
load, which is considered to be the quantity of viral particles circulating in the blood, the number 
of viral mutants which is directly proportional to the viral load diminishes as well, thereby 
reducing the risks of therapeutic escape. 

[0008] These extremely intense treatments are unfortunately accompanied by numerous 

side effects. They moreover require perfect compliance which, if not respected, is accompanied 
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almost systematically by the emergence of resistant strains. These selected resistances under the 
pressure of antiretroviral agents are at the origin of most of the therapeutic escapes. 
[0009] Thus, although the choice of a combination of antiretroviral agents appears to be 

fundamental, the optimized combination of these agents does not appear to be obvious. In 
addition to the multiple problems posed by the resistances which we have just described, the 
incompatibility of certain drug combinations and the constantly increasing number of 
antiretroviral agents makes the practitioner's work more and more difficult. 
[0010] Physicians at present have available about twenty therapeutic agents essentially 

directed against two viral proteins - reverse transcriptase and protease. The most common 
therapeutic regimens involve triple drug combinations. A total of 252 possible combinations 
have been described - based only on the most common combinations. These calculations are 
statistical and do not take into account the different drug incompatibilities. Moreover, the 
appearance of new active ingredients stemming from pharmaceutical research will have the direct 
consequence of further complicating the problem of the selection of the drug combination. 
[0011] The activity of other pathogenic organisms is also of concern: the flu virus was 

responsible for 20 million deaths during the 20 th century and the Ebola virus emerged in an 
alarming manner. The hepatitis A, B, C, D and E viruses constitute veritable public health 
priorities both because of their Boolean status and their potential gravity. 

[0012] In all of these cases, there is a therapeutic and vaccinal vacuum which increases 

each year because of the great mutability of the viral genomes, especially that of the retroviruses, 
RNA viruses such as HIV, flu, Ebola, hepatitis C, etc. 

[0013] Many approaches have been proposed for attempting to resolve these 

multiresistance problems linked with the high degree of mutability of certain pathogenic 
organisms. The company Virco Tibotech, for example, developed a method directed by a 
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computer program that enables comparison of a given genotype with a databank of HIV 
sequences. It then defines a list of the possible resistances to the antiretroviral agents. 
[0014] Moreover, certain web sites such as that of the Los Alamos Library ( http://hiv- 

web.lanl.gov/ ) provide a large amount of data regarding the alignments of the HIV protein 
sequences as well as their mutations. 

[0015] Similarly, many publications by Ribeiro et al. disclose methods employing the 

calculation of the Boolean status of the appearance of resistant mutants using rather complex 
mathematic calculations. 

[0016] Thus, methods for identifying the mutations of the constituent motifs of 

nucleotide or polypeptide sequences have been developed, e.g., those that made it possible during 
the 1980s to classify the immunoglobulins into classes and subclasses comprising constant 
domains and variable domains as a function of the variability of motifs of the different sequences 
that comprised them. 

[0017] However, these methods do not enable identification of motifs whose mutation 

possibility is predetermined in relation to the set of sequences analyzed. In the framework of this 
invention, this mutation possibility corresponds to a Boolean state of mutation. 
[0018] It would therefore be advantageous to provide for the identification of multiple 

motifs the Boolean state of relative mutation of which is predetermined in relation to a set of 
given sequences. This method should be based on the identification either of motifs or 
combinations of motifs not ever having had mutated simultaneously, or motifs or combinations 
of motifs having mutated simultaneously at least once on at least one sequence of a set and not 
having mutated on other sequences of the set. 
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Summary of the Invention 

[0019] This invention relates to a method for identifying a motif or a combination of 

motifs having a Boolean state of predetermined mutations in a set of sequences including a) 
aligning a set of sequences of ordered motifs represented by a single-character code, b) 
comparing a reference sequence with the set of sequences aligned in step (a), c) identifying 
motifs not having mutated simultaneously or motifs having mutated simultaneously at least once 
on at least one sequence of the set and not having mutated on another sequence of said set. 
[0020] This invention also relates to a pharmaceutical composition for treatment of 

influenza, HIV and hepatitis C including a therapeutically effective amount of the motif or 
combination of motifs. 

[0021] This invention further relates to a method of treating influenza, HIV and hepatitis 

C including administering a therapeutically effective amount of the pharmaceutical composition. 

Detailed Description 

[0022] This invention provides a new tool to enable finding more durable solutions 

during therapeutic treatments of pathologies involving pathogenic organisms or human genes 
having a high degree of mutability. 

[0023] The invention also provides for the use of sequences constituted by or comprising 

the motifs and/or combinations of motifs thereby identified treating or preventing human, animal 
or plant pathologies, the preparation of therapeutic targets for the screening of said drugs, the 
docking of a drug on its target, the development of new diagnostic tools in which, for example, 
the selection of one or more therapeutic agents can be performed as a function of the mutability 
of the pathogenic organism responsible for the disease of a given patient. 
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[0024] The term "motif as used herein is understood to mean a nucleotide capable of 

being part of a synthetic nucleic acid or oligonucleotide sequence designated below by its single- 
character code: A, G, C, T or U, corresponding to the nomenclature of the respective base 
(adenine A, guanine G, cytosine C or thymine T in the DNA, or uracil U in the RNA) of which 
they are constituted. 

[0025] The term "motif is also understood to mean an amino acid, irrespective of its 

configuration, capable of being part of a natural or synthetic protein or peptide, designated by its 
single-character code such as, e.g., represented in the table below. 



Codes of the amino acids 



Code 


Amino Acid 


A 


alanine 


C 


cysteine 


D 


aspartic acid 


E 


glutamic acid 


F 


phenylalanine 


G 


glycine 


H 


histidine 


i 


isoleucine 


K 


lysine 


L 


leucine 


M 


methionine 


N 


asparagine 


P 


proline 


o 


glutamine 


R 


arginine 


S 


serine 


T 


threonine 


V 


valine 


w 


tryptophan 


Y 


tyrosine 



[0026] The term "sequence" is understood to mean any chaining of motifs as defined 

above capable of constituting a sequence of a nucleic acid or a fragment thereof of a living 
organism or a sequence of a protein or a fragment thereof of a living organism, including wild 
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sequences, mutant sequences or artificial sequences similar to those obtained by chemical or 
biological synthesis according to methods known in the art. As nonlimitative examples, it is 
understood that a sequence containing such motifs can be a group of genes, a gene or a fragment 
thereof, a group of proteins, a protein or a fragment thereof. 

[0027] The term "variant of a sequence" is understood to mean any sequence differing 

from the original or wild sequence by at least one motif. 

[0028] Thus, the invention identifies motifs that did not mutate simultaneously among all 

of the members of a set of sequences. The identification of such motifs is a major achievement 
among new pharmacological developments both in terms of therapeutic targets as well as at the 
level of the searching for new therapeutic compounds, especially in the framework of resistance 
and multiple-resistances developed by pathogenic organisms which are harmful for both animal 
species as well as plant species. 

[0029] The invention also pertains to the use of these fragments of sequences constituted 

by and/or comprising motifs that did not mutate simultaneously for therapeutic targets that are 
useful for screening drugs as well as for vaccines directed against pathogenic organisms and, in 
particular, against pathogenic organisms having a high degree of mutability. 
[0030] The invention further pertains to the use of sequences constituted by and/or 

comprising motifs that did not mutate simultaneously for compounds useful for preventing and 
treating human and/or animal pathologies, and in particular pathologies the responsible genes of 
which have a high degree of mutability. 

[0031] The use of fragments of particular sequences of the pathogenic organisms 

constituted by and/or comprising the motifs that did not mutate simultaneously as therapeutic 
compounds makes it possible, among other things, to: 

- Decrease the appearance of resistances during therapeutic treatment; 
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- Stabilize the health of the patient over the long term by permitting the use of the drugs 
available on the market for a longer period of time; 

- Avoid the appearance of opportunistic diseases and thereby decrease the overall cost of 
the treatment; 

- Decrease the duration and the cost of investments in research and development in the 
pharmaceutical industry. 

[0032] This invention, thus, provides a new tool for optimizing selection of therapeutic 

treatments directed against pathogenic organisms with a high degree of mutability or against 
pathologies due to the appearance of mutations. 

[0033] One aspect of the method of the invention for identifying motifs comprises 

comparing a subset of variants of the same nucleotide or polypeptide sequence of a given 
pathogenic organism by a reference sequence, for example, a consensus sequence, and then 
identifying during this comparison the motifs of the sequences which did not mutate 
simultaneously or the motifs which mutate simultaneously at least once on at least one of the 
sequences of the subunit and do not mutate on the other sequences of the subunit. 
[0034] The invention more precisely provides a method for identifying a motif or a 

combination of motifs having a Boolean state of predetermined mutation in a set of sequences, 
comprising: 

a) alignment of sequences of ordered motifs represented by their single-character code, 

b) comparison of a reference sequence with the set of sequences aligned in step (a), 

c) identification of the motifs that did not mutate simultaneously or of the motifs having 
mutated simultaneously at least once on at least one of the sequences of the set and not having 
mutated on the other sequences of the set. 
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[0035] According to one embodiment, the motif or the combination of motifs to be 

identified is a nucleotide or a combination of nucleotides and the subset of sequences can be 
extracted from a databank of nucleic acids. 

[0036] According to another embodiment, the motif or the combination of motifs to be 

identified is an amino acid or a combination of amino acids and the subset of sequences can be 
extracted from a databank of polypeptides and/or proteins. 

[0037] The alignment of the sequences can be performed by means of any alignment 

method known in the art. 

[0038] For example, when the number of sequences of the subset that is being used is less 

than 100, it is possible to use the alignment method of Clustal W. (Thompson, J.D., Higgins, 
D.G. and Gibson, TJ. (1994) CLUSTAL W: Improving the sensitivity of progressive multiple 
sequence alignment through sequence weighting, position-specific gap penalties and weight 
matrix choice. Nucleic Acids Research, 22: 4673-4680). 

[0039] If the number of sequences to analyze is larger, e.g., greater than 100, the 

alignment proposed by Clustal W. is too long and it is necessary to employ an iterative alignment 
based on a hidden Markov model, referred to below as HMM (Sean Eddy, "Hidden Markov 
Models", Curr. Opin. Struct. Biol. Vol. 6, pages 361-365, 1966). 

[0040] In this latter case, there is created, for example, a first subset of 100 sequences 

extracted from the set of sequences to be analyzed to which is applied the Clustal method to 
obtain a first alignment. 

[0041] A hidden Markov model (HMM) is created from this first alignment. The model 

is possibly calibrated to make it more sensitive, then one adds to the first alignment new 
sequences which will in turn be aligned again using HMM. 
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[0042] The reference sequence of step (b) is advantageously constituted by a wild 

sequence or by a consensus sequence comprising in position i the motif present in position i in a 
predetermined number of sequences of step (a), for example, in more than 30% of said sequences 
and more preferably in more than 75% of said sequences, with it being possible to adjust these 
values according to the case. 

[0043] Step (b) comprising comparison of sequences of the identification method of the 

invention advantageously comprises: 

- constituting a first numerical matrix A of dimensions NxM in which N designates the 
number of sequences and M designates the number of motifs of one of the sequences of said 
alignment, with the value Ay being equal to a first value Al [for example, "0"] when the motif of 
position i of the sequence j is mutated in relation to the motif of position i of the reference 
sequence and equal to a second value A2 [for example, "1"] in the other cases, 

- constituting two analysis matrices B, C of the mutations in which the matrices are: 

- a matrix B of unmutated couples, i.e., of couples which did not mutate 
simultaneously, of dimension MxM, the value Bj )k = B k ,i being equal: 

to a first value Bl [for example, "0"] when Ajj = A kj = Al irrespective of 
the value of j ranging from 0 to N, 

to a second value B2 [for example "1"] in the other cases; 
. - a matrix C of mutated couples [i.e., of couples that mutate either always, or 
never simultaneously] of dimension MxM, the value C k ,i = C ijk being equal: 

to a second value CI [for example, "1"] when Ajj = A kJ irrespective of the 
value of j ranging from 0 to N, 

to a first value C2 [for example, "0"] in the other cases; 
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- of determining for a set E of positions a coefficient R E whose value is R\ [for example, 
"1"] when the values B i)k are equal to the second value B 2 , irrespective of the values of i and k 
belonging to the set E of the positions, in which i k, 

- of determining for a set F of positions, a coefficient R F , the value of which is R\ [for 
example, "1"] when the values C ijk are equal to the second value CI, irrespective of the values of 
i and k belonging to the set F of the position in which i k. 

[0044] According to one embodiment, in step (b) of the method, the positions of the sets 

E and/or F are designated by the user. 

[0045] According to another embodiment, step (b) of the method comprises a test step of 

generating a totality of the combinations of the possible positions and determining for each of the 
combinations the value of the coefficients R E or R F> and of retaining the combination 
corresponding to the largest set of positions of which R E or R F correspond to the second value. 
[0046] The matrix of mutated couples of the invention advantageously makes it possible 

to identify two motifs having mutated simultaneously at least once on at least one of the 
sequences of the set and not having mutated on the other sequences of the set. 
[0047] The invention also pertains to a way for performing a comparison of the 

sequences containing the motifs and identifying the motifs thereof, either having mutated 
simultaneously at least once on at least one of the sequences of the set and not having mutated on 
the other sequences of the set and comprising: 

- constituting a first numerical matrix A of dimensions NxM in which N designates the 
number of sequences and M designates the number of motifs of one of the sequences of the 
alignment, the value Ay being equal to a first value A\ [for example, "0'] when the motif of 
position i of the sequence j is mutated in relation to the motif of position i of the reference 
sequence and equal to a second value A2 [for example, "1"] in the other cases, 
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- constituting two analysis matrices B, C of the mutations M in which this matrix is: 

- a matrix B of unmutated couples, i.e., couples which did not mutate 
simultaneously, of dimension MxM, the value B i}k = B k>i being equal: 

to a first value Bl [for example, "0"] when Ajj = A k j = 0 
irrespective of the value of j ranging from 0 to N, 

• to a second value B2 [for example, "1"] in the other cases; 

- a matrix C of mutated couples [i.e., couples that mutate either once 
simultaneously or never] of dimension MxM, the value C i>k = C k ,i being equal: 

to a second value CI [for example, "1"] when Ajj = A kj 
irrespective of the value of j ranging from 0 to N, 

to a first value C2 [for example, "0"] in the other cases; 

- of determining for a set E of positions a coefficient R E , the value of which is Rl [for 
example, "1"] when all of the values B i>k are equal to the second value B2, irrespective of the 
values of i and k belonging to the set E of said positions, in which i j, 

- of determining for a set F of positions a coefficient R F the value of which is Rl [for 
example, "1"] when all of the values Q, k are equal to the second value C2, irrespective of the 
values of i and k belonging to the set F of said positions, in which i j. 

[0048] The sequences analyzed by the identification preferably comprise a subset of 

sequences extracted from a databank of nucleotide or polypeptide sequences of pathogenic 
organisms and most preferentially by nucleotide or polypeptide sequences of pathogenic 
organisms presenting a high degree of mutability. 

[0049] According to one embodiment, the subset of sequences comprises all the 

polypeptide sequences of the different known variants of the protease of the human 
immunodeficiency virus. 



[0050] According to another embodiment, the subset of sequences comprises all of the 

polypeptide sequences of the different known variants of the reverse transcriptase of the human 
immunodeficiency virus. 

[0051] According to yet another embodiment, the subset of sequences comprises all of 

the polypeptide sequences of the different known variants of the integrase of the human 
immunodeficiency virus. 

[0052] The invention pertains to identifying motifs belonging to pathogenic agents, the 

nucleic acid and/or polypeptide sequences of which are capable of having mutations. 
[0053] As a nonlimitative example of such sequences we can cite the sequences of 

viruses such as the hepatitis C virus which is an RNA virus characterized by the high degree of 
variability of its genome, with 3% of world prevalence and 600,000 persons infected in France, 
the Ebola virus which causes hemorrhagic fevers and which is associated with a high mortality 
rate, the sequences of the flu virus for which it is necessary to develop new vaccines each year or 
the sequences of other viruses emerging with a high rate of mutability. 

[0054] Thus, according to a particular aspect of the invention, the subset of extracted 

sequences comprises the polypeptide sequences of the different variants of the neuraminidase of 
the flu virus. 

[0055] According to another particular aspect of the invention, the subset of extracted 

sequences comprises all of the polypeptide sequences of the different variants of the 
hemagglutinin of the flu virus. 

[0056] Thus, among the sequences of the bacteria capable of having mutations, examples 

include the C-terminal sequence of the protein HspA of the bacterium Helicobacter pilori or the 
HA-type adhesin of the bacterium Escherichia coli. 
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[0057] The method for identifying motifs of the invention is not limited solely to the 

domain of pathogenic agents. Sets of sequences having motifs which did not mutate 
simultaneously, or in contrast had mutated together at least once on at least one of the sequences 
of the set and had never mutated on the other sequences of the set are also presented in other 
pathologies such as, for example, pathologies in the field of cancer research. 
[0058] It can be acknowledged that a large percentage of cancers are due to the presence 

of transposable elements that have a large degree of homology with the viruses, and that the 
hepatitis B virus is the second identified cause of cancer death after tobacco. 
[0059] Thus, among the genes implicated in human cancers, capable of having motifs 

that mutate and for which the set of sequences have sometimes been constituted, we can cite as 
examples the APC gene which has been essentially implicated in cancer of the colon (Nucleic 
Acids Res 1998, Jan 1; 26(1): 269-270, APC gene: database of germline and somatic mutations 
in human tumors and cell lines. Laurent-Puig P, Beroud C, Soussi T), the gene P53 (Nucleic 
Acids Res 1997, Jan 1; 25(1): 138, p. 53 and APC gene mutations: software and databases. 
Beroud C, Soussi T), MEN-1 (A malignant gastrointestinal stromal tumor in a patient with 
multiple endocrine neoplasia type 1. Papillon E, Rolachon A, Calender A, Chabre O, Barnoud R, 
Fournet J), VHL (Mutations of the VHL gene in sporadic renal cell carcinoma: definition of a 
risk factor for VHL patients to develop an RCC. Gallou C, Joly D, Mejean A, Staroz F, Marin N, 
Tarlet G, Orfanelli MT, Bouvier R, Droz D, Chretien Y, Marechal JM, Richard S, Junien C, 
Beroud C), WT1 (Clin Cancer Res 2000, Oct; 6(10): 3957-65. WT1 splicing alterations in 
Wilms' tumors. Baudry D, Hamelin M, Cabanis MO, Fournet JC, Tournade MF, Sarnacki S, 
Junien C, Jeanpierre C). 
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[0060] The invention also includes identifying motifs described above for selecting 

fragments of sequences constituted by and/or comprising motifs that did not mutate 
simultaneously for vaccines. 

[0061] Vaccines are composed of antigens constituted by molecules or parts of molecules 

of a pathogenic organism which when they are injected in the organism enable production of a 
larger number of antibodies against the pathogenic organism. These antibodies recognize the 
molecules against which they are directed and thereby enable the immune system to destroy the 
pathogenic organism. 

[0062] There is a nonnegligible lapse of time — often many years — between the 

moment at which the vaccine is defined and the moment at which it becomes available on the 
market. For example, with regard to HIV, the high polymerization accuracy of the reverse- 
transcriptase confers on the virus a high degree of genomic variability which increases as a 
function of time. The viral population is thus very heterogeneous. Destruction of the wild virus 
by the vaccine leads to the selection of mutant viruses against which the vaccine remains 
ineffective. 

[0063] The application of the method of the invention to subsets of variant sequences of 

the protein sequences of pathogenic sequence makes it possible to trap these mutant virus: 

- either it mutates but, in this case, it is no longer functional; 

- or it does not mutate, but then the antibodies produced by the vaccine will be capable of 
destroying it. 

[0064] For example, with regard to HIV, the peptides which comprise the proteins of the 

virus envelope, identified because they do not mutate together, probably due to genetic pressure 
which would cause them to lose their functionality, are vaccine candidates of choice. 
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[0065] In fact, the method for identifying peptide motifs enables selected sequences 

containing the motifs - either contiguously or not - to prepare a candidate vaccine. The vaccine 
was as an advantage — in relation to other vaccines developed by conventional means — that it 
is described in exhaustive manner and contains certain regions necessary for the stability of the 
vaccine precisely by selection of the sequences that did not mutate simultaneously together, 
leading to the destruction of the pathogenic organism. 

[0066] The identification of the motifs that did not mutate simultaneously is more 

complex for two main reasons: 

- the number of amino acids not mutating is about ten times larger, and 

- the combination of amino acids to be tested not being determined in advance, all of the 
combinations must be envisaged. 

[0067] The invention also pertains to the use of fragments of sequences constituted by 

and/or comprising nucleotide and/or peptide motifs of the analyzed sequences that did not mutate 
simultaneously for a vaccine. 

[0068] The invention also includes a method for identifying motifs or combination of 

motifs that did not mutate simultaneously to develop diagnostic tools. The invention further 
includes use of such an identification method to fragments of sequences constituted by and/or 
comprising motifs having mutated simultaneously for diagnostic tests. 

[0069] The method of the invention also makes it possible to construct a database which 

constitutes a decision-making tool, for example, for determining by the physician of the 
administration of antiviral therapies to a given patient. 

[0070] According to another aspect, the method for identifying motifs that did not mutate 

simultaneously comprises a supplementary step comprising comparing data linking known drug 
resistances to observed mutations, for example, in the case of HIV, to the data disclosed by J. 
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Hammond et aL in "Mutations in Retroviral Genes Associated with Drug Resistance". (The 
Human Retroviruses and AIDS Compendium, 1999). 

[0071] The drug-mutated amino acid relationship demonstrated in this manner is very 

useful for improving treatment. For example, with regard to HIV, comparison of the peptide 
motifs is performed on three subsets of a protein database, pertaining to reverse transcriptase, 
protease and integrase ( http ://hi v- web . lanl . go v/ ) . 

[0072] The comparison of the sequences belonging to the subsets comprising from about 

300 to about 8000 sequences or fragments of the sequences of each of these three proteins 
enables application of the method of the invention to identify combinations of amino acids that 
did not mutate simultaneously. 

[0073] Thus, the method of the invention makes it possible to identify the mutations 

induced under the pressure of selection. 

[0074] The aspect of the invention comprising comparison with the drug resistances 

enables selection of a combination of drugs such that the amino acid mutations capable of being 
induced by each of the antiviral agents, capable of conferring resistance on the various drugs 
involved in this combination (fewer than ten), are not produced simultaneously. Identification of 
such motifs enables selection of a drug combination which disfavors the appearance of more than 
one mutation at a time, thereby closing the door to multiple resistances. The practitioner can 
then use the information obtained by applying this method, for example, to isolated viral 
sequences or viral sequences deduced from the isolated viral genome, of a given patient to ensure 
that the envisaged multi-drug therapy is in fact the most effective possible. With the 
identification of a first mutation excluding the two others, a selected three-agent therapy thereby 
enables the two remaining antiretroviral agents to continue to be effective. 
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[0075] The aspect of identification of peptide regions not having mutated simultaneously 

also provides valuable assistance in the case of the appearance of resistances in already treated 
patients. The method according to the invention can, for example, be applied to the subsets of 
polypeptide sequences among which is included that or those deduced from the sequencing of the 
isolated viral genome of the patient. Thus, if this genotyping reveals a mutation responsible for 
resistance, the method of identification of peptide motifs not having mutated allows 
implementation of a multiple-therapy regimen designed to maintain the selection pressure on the 
mutation. The molecule identified in this manner can be accompanied by two or three 
antiretroviral agents which target domains of the protein not capable of mutating at the same time 
as the zone that mutated. 

[0076] Such a method is useful for the implementation of new antiretroviral 

combinations maximally preventing therapeutic escape. Thus, for example, identification of 
motifs within a given gene having mutated at least once simultaneously on at least one variant 
and not having mutated on other variants, enables identification of regions of the gene which 
could present a physical or functional interaction. In contrast, identification of motifs not having 
mutated simultaneously enables identification of regions of the gene whose mutual presence is 
essential and indispensable for its function. 

[0077] The invention also provides for identification of a set of genes or a set of 

noncoding sequences of motifs not having mutated simultaneously. Identification of such motifs 
enables selection of genetic regions that can have physical or functional interactions on the 
overall genome. 

[0078] Another aspect of the invention relates to a method for identifying motifs and 

combinations of motifs for selecting fragments constituted by and/or comprising motifs not 
having mutated simultaneously for the preparation of therapeutic targets. 
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[0079] Still another aspect of the invention pertains to the use of fragments of sequences 

constituted by and/or comprising motifs either having mutated at least once on at least one 
sequence of the set and not having mutated on the other sequences of the set for the preparation 
of therapeutic targets. 

[0080] The invention also relates the use of motifs or combinations of motifs identified in 

this manner for preparing therapeutic targets that are useful for screening new therapeutic 
compounds to prevent and/or treat human, animal or plant pathologies. Thus, the preparation, 
after having identified motifs not having mutated simultaneously, or sequence fragments 
containing them, enables preparation of a therapeutic target against which will be tested 
therapeutic compounds directed against the pathogenic organism and especially therapeutic 
compounds against which the wild pathogenic organism can not develop resistance mutations. 
[0081] The selection of fragments constituted by and/or comprising motifs not having 

mutated simultaneously is, thus, useful for the preparation of diagnostic tools since it is not 
always easy to detect rapidly a certain type of or subtype of pathogenic organism, because the 
identification of peptide motifs according to aspects of the invention enables preparation of 
fragments of peptides comprising the motifs most representative of a subtype of a pathogenic 
organism. These fragments are then used in detection tests such as, for example, immunoenzyme 
tests. 

[0082] This application of the invention comprises identifying a set of motifs 

indispensable for the function of a protein of a human, animal or plant organism or of a 
pathogenic organism. These motifs can constitute, for example, a subset of amino acids known 
to play an important role in the function of the targeted protein. The motifs identified in this 
manner are advantageously contiguous motifs of the genetic sequence and represent a linear 
sequence of the gene. The motifs identified are advantageously motifs noncontiguous on the 
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linear sequence of the gene. They can then be useful for completing three-dimensional analysis 
studies to confirm a possible nonlinear spatial proximity of the motifs. The method of the 
invention can then include a new supplementary step (g) after the step (e) of identification of the 
motifs, the step comprising comparing the motifs with the three-dimensional structural data of 
these proteins such as the amino acids involved in the catalytic site and/or in the sites linked by 
noncompetitive inhibitors. This latter comparison produces a list of amino acids involved in the 
protein function and not having mutating together. 

[0083] The invention also uses fragments of sequences constituted by and/or comprising 

peptide motifs having mutated simultaneously for the development of diagnostic tools. The 
method for the identification of peptide regions defines the most representative peptides of a 
subtype. Once they are identified, these peptides are used in detection tests known in the art, 
such as, for example, immunoenzyme tests of the ELISA type. 

[0084] The search for peptides representing a subtype of a particular type is performed as 

indicated above. It is a question of finding peptide antigens capable of being recognized by a 
particular serum containing or not containing the antibodies of a particular subtype. The method 
according to the invention can be applied to any databank of sequences. The results are 
compared by subtypes and the theoretical peptide combination the most representative of a 
particular pathogenic type is thereby identified. The peptides identified in this manner are 
synthesized and tested immunologically against a collection of serums. 

[0085] The invention exhibits its value especially when it is used for the identification 

either of motifs having mutated once together or not having mutated, from a large number of 
sequences comprising a large number of motifs to select the sequences of motifs useful for the 
various applications envisaged above. 
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[0086] To illustrate the method for the identification of motifs of the invention, the 

example below shows the different matrices constituted in a comparison of motifs performed on 



a subset of eight sequences based on the reference sequence SVRLGHKDEV. 



POSITIONS 


0123456789 


Reference sequence (consensus) 


SVRLGHKDEV 




Subset of sequences 


Alignment 


Seq 1 


SRRLGHKDEV 


Seq2 


S VRLGHKLEV 


Seq 3 


SRDLGHKDE V 


Seq 4 


S VRLGHLD V V 


Seq 5 


S VDLGHKTE V 


Seq 6 


SKRLGHKDEV 


Seq 7 


S VRLGHGDGV 


Seq 8 


S VRLGHKSEV 



1 . MUTATION MATRIX A 



[0087] Attributed values: 

Al = 0, if motif mutated in relation to the reference sequence 



A2 = 1, if another case (motif not mutated in relation to the reference sequence). 



POSITION 


0123456789 ' 


Seq 1 


10 11111111 


Seq 2 


11111110 11 


Seq 3 


1001111111 


Seq 4 


1111110 101 


Seq 5 


110 11110 11 


Seq 6 


10 11111111 


Seq 7 


1111110 10 1 


Seq 8 


11111110 11 



2. NONMUTATED MATRIX B 



[0088] Attributed values: 

Bl = 0, if couple of motifs mutated simultaneously 

B2 = 1, if another case (couple of motifs never having had mutated 
simultaneously) 
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POSITION 


0123456789 


POSO 


1111111111 


POS1 


1001111111 


POS2 


1001111011 


POS3 


1111111111 


POS4 


1111111111 


POS5 


1111111111 


POS6 


1111110 10 1 


POS7 


11111110 11 


POS8 


1111110 10 1 


POS9 


1111111111 



3. MUTATED MATRIX C 



[0089] Attributed values: 

CI = 1, if couple of motifs mutated simultaneously or never mutated, 



C2 = 0, other cases. 



POSITION 


0123456789 


POSO 


0000000000 


POS1 


0000000000 


POS2 


0000000000 


POS3 


0000000000 


POS4 


0000000000 


POS5 


0000000000 


POS6 


0000000010 


POS7 


0000000000 


POS8 


000000 1 000 


POS9 


0000000000 



[0090] The interrogation of the mutated matrix C thus makes it possible to identify the 

motifs in positions 6 and 8 as motifs having mutated at least once together. 
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